{
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7-final"
  },
  "orig_nbformat": 2,
  "kernelspec": {
   "name": "python3",
   "display_name": "DataAnalysis"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2,
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tools import *\n",
    "import re\n",
    "%matplotlib inline"
   ]
  },
  {
   "source": [
    "# Chap 3 处理原始文本\n",
    "1.  如何访问文件内的文本？\n",
    "2.  如何将文档分割成单独的单词和标点符号，从而进行文本语料上的分析？\n",
    "3.  如何产生格式化的输出，并把结果保存在文件中？"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "## 3.4 使用正则表达式检测词组搭配\n",
    "（本书可以帮助你快速了解正则表达式）"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "--------------- >取出所有的小写字母拼写的单词< ---------------\n",
      "['a', 'aa', 'aal', 'aalii', 'aam', 'aardvark', 'aardwolf', 'aba', 'abac', 'abaca', 'abacate', 'abacay', 'abacinate']\n"
     ]
    }
   ],
   "source": [
    "show_subtitle(\"取出所有的小写字母拼写的单词\")\n",
    "wordlist = [\n",
    "        w\n",
    "        for w in nltk.corpus.words.words('en')\n",
    "        if w.islower()\n",
    "]\n",
    "print(wordlist[:13])"
   ]
  },
  {
   "source": [
    "### 3.4.1 使用基本的元字符"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "--------------- >搜索以'ed'结尾的单词< ---------------\n",
      "['abaissed', 'abandoned', 'abased', 'abashed', 'abatised', 'abed', 'aborted', 'abridged', 'abscessed', 'absconded', 'absorbed', 'abstracted', 'abstricted']\n"
     ]
    }
   ],
   "source": [
    "show_subtitle(\"搜索以'ed'结尾的单词\")\n",
    "word_ed_list = [\n",
    "        w\n",
    "        for w in wordlist\n",
    "        if re.search('ed$', w)\n",
    "]\n",
    "print(word_ed_list[:13])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "--------------- >搜索以'**j**t**'形式的单词，'^'表示单词的开头，'$'表示单词的结尾< ---------------\n",
      "['abjectly', 'adjuster', 'dejected', 'dejectly', 'injector', 'majestic', 'objectee', 'objector', 'rejecter', 'rejector', 'unjilted', 'unjolted', 'unjustly']\n"
     ]
    }
   ],
   "source": [
    "#通配符「.」用于匹配任何单个字符\n",
    "show_subtitle(\"搜索以'**j**t**'形式的单词，'^'表示单词的开头，'$'表示单词的结尾\")\n",
    "word_jt_list = [\n",
    "        w\n",
    "        for w in wordlist\n",
    "        if re.search('^..j..t..$', w)\n",
    "]\n",
    "print(word_jt_list[:13])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "--------------- >ab< ---------------\n",
      "sum_word_list=  16\n",
      "['aba', 'abac', 'abaca', 'aback', 'abas', 'abb', 'abu', 'aby']\n"
     ]
    }
   ],
   "source": [
    "# 通配符「?」表示前面的字符有 0 个或者 1个\n",
    "# 通配符「*」表示前面的字符有 0 个或者 多个\n",
    "show_subtitle(\"ab\")\n",
    "sum_word_list=sum(\n",
    "    1\n",
    "    for w in wordlist\n",
    "    if re.search('^aba?c.*$', w)\n",
    ")\n",
    "word_abc_list=[\n",
    "    w\n",
    "    for w in wordlist\n",
    "    if re.search('^aba?c*.$',w)\n",
    "]\n",
    "print(\"sum_word_list= \",sum_word_list)\n",
    "print(word_abc_list)"
   ]
  },
  {
   "source": [
    "### 3.4.2 范围与闭包"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "--------------- >[ghi]表示三个字母中任意一个< ---------------\n",
      "['gold', 'golf', 'hold', 'hole']\n"
     ]
    }
   ],
   "source": [
    "show_subtitle(\"[ghi]表示三个字母中任意一个\")\n",
    "word_ghi_list = [\n",
    "        w\n",
    "        for w in wordlist\n",
    "        if re.search('^[ghi][mno][jlk][def]$', w)\n",
    "]\n",
    "print(word_ghi_list[:13])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "['', '!', '!!', '!!!', '!!!!', '!!!!!', '!!!!!!', '!!!!!!!', '!!!!!!!!', '!!!!!!!!!', '!!!!!!!!!!', '!!!!!!!!!!!', '!!!!!!!!!!!!!']\n"
     ]
    }
   ],
   "source": [
    "chat_words = sorted(set(\n",
    "        w\n",
    "        for w in nltk.corpus.nps_chat.words()\n",
    "))\n",
    "print(chat_words[:13])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "--------------- >'+'表示一个或者多个< ---------------\n['miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee', 'miiiiiinnnnnnnnnneeeeeeee', 'mine', 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']\n"
     ]
    }
   ],
   "source": [
    "show_subtitle(\"'+'表示一个或者多个\")\n",
    "word_plus_list = [\n",
    "        w\n",
    "        for w in chat_words\n",
    "        if re.search('^m+i+n+e+$', w)\n",
    "]\n",
    "print(word_plus_list[:13])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "--------------- >'*'表示零个或者多个< ---------------\n['', 'e', 'i', 'in', 'm', 'me', 'meeeeeeeeeeeee', 'mi', 'miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee', 'miiiiiinnnnnnnnnneeeeeeee', 'min', 'mine', 'mm']\n"
     ]
    }
   ],
   "source": [
    "show_subtitle(\"'*'表示零个或者多个\")\n",
    "word_star_list = [\n",
    "        w\n",
    "        for w in chat_words\n",
    "        if re.search('^m*i*n*e*$', w)\n",
    "]\n",
    "print(word_star_list[:13])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "--------------- >'^'表示没有这些字符< ---------------\n['!', '!!', '!!!', '!!!!', '!!!!!', '!!!!!!', '!!!!!!!', '!!!!!!!!', '!!!!!!!!!', '!!!!!!!!!!', '!!!!!!!!!!!', '!!!!!!!!!!!!!', '!!!!!!!!!!!!!!!!']\n"
     ]
    }
   ],
   "source": [
    "# [^aeiouAEIOU]：表示没有这些字母的单词，即没有元音字母的单词，就是标点符号\n",
    "show_subtitle(\"'^'表示没有这些字符\")\n",
    "word_hat_list = [\n",
    "        w\n",
    "        for w in chat_words\n",
    "        if re.search('[^aeiouAEIOU]', w)\n",
    "]\n",
    "print(word_hat_list[:13])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "--------------- >对比 * + ? 三个的区别< ---------------\n*=  481\n+=  481\n?=  234\n"
     ]
    }
   ],
   "source": [
    "wsj = sorted(set(nltk.corpus.treebank.words()))\n",
    "# 前面两个是一样的，因为小数肯定都会有整数在前面，而第三个不一样，是因为'?'表示零个或者一个，不包括大于10的整数\n",
    "show_subtitle(\"对比 * + ? 三个的区别\")\n",
    "print(\"*= \", len([w for w in wsj if re.search(\"^[0-9]*\\.[0-9]+$\", w)]))\n",
    "print(\"+= \", len([w for w in wsj if re.search(\"^[0-9]+\\.[0-9]+$\", w)]))\n",
    "print(\"?= \", len([w for w in wsj if re.search(\"^[0-9]?\\.[0-9]+$\", w)]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "--------------- >'\\' 斜杠的作用< ---------------\n['C$', 'US$']\n"
     ]
    }
   ],
   "source": [
    "show_subtitle(\"'\\\\' 斜杠的作用\")\n",
    "word_slash_list = [\n",
    "        w\n",
    "        for w in wsj\n",
    "        if re.search(\"^[A-Z]+\\$$\", w)\n",
    "]\n",
    "print(word_slash_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "--------------- >四位数的整数< ---------------\n['1614', '1637', '1787', '1901', '1903', '1917', '1925', '1929', '1933', '1934', '1948', '1953', '1955', '1956', '1961', '1965', '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1975', '1976', '1977', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2005', '2009', '2017', '2019', '2029', '3057', '8300']\n"
     ]
    }
   ],
   "source": [
    "show_subtitle(\"四位数的整数\")\n",
    "word_four_list = [\n",
    "        w\n",
    "        for w in wsj\n",
    "        if re.search('^[0-9]{4}$', w)\n",
    "]\n",
    "print(word_four_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "--------------- >1位以上整数-3~5位长的单词< ---------------\n['10-day', '10-lap', '10-year', '100-share', '12-point', '12-year', '14-hour', '15-day', '150-point', '190-point', '20-point', '20-stock', '21-month', '237-seat', '240-page', '27-year', '30-day', '30-point', '30-share', '30-year', '300-day', '36-day', '36-store', '42-year', '50-state', '500-stock', '52-week', '69-point', '84-month', '87-store', '90-day']\n"
     ]
    }
   ],
   "source": [
    "show_subtitle(\"1位以上整数-3~5位长的单词\")\n",
    "word_num_word_list = [\n",
    "        w\n",
    "        for w in wsj\n",
    "        if re.search('^[0-9]+-[a-z]{3,5}$', w)\n",
    "]\n",
    "print(word_num_word_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "--------------- >5位以上长的单词-2~3位长的单词-6位以下长的单词< ---------------\n['black-and-white', 'bread-and-butter', 'father-in-law', 'machine-gun-toting', 'savings-and-loan']\n"
     ]
    }
   ],
   "source": [
    "show_subtitle(\"5位以上长的单词-2~3位长的单词-6位以下长的单词\")\n",
    "word_word_word_list = [\n",
    "        w\n",
    "        for w in wsj\n",
    "        if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)\n",
    "]\n",
    "print(word_word_word_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "--------------- >寻找分词< ---------------\n",
      "['62%-owned', 'Absorbed', 'According', 'Adopting', 'Advanced', 'Advancing', 'Alfred', 'Allied', 'Annualized', 'Anything', 'Arbitrage-related', 'Arbitraging', 'Asked']\n"
     ]
    }
   ],
   "source": [
    "show_subtitle(\"寻找分词\")\n",
    "word_participle_list = [\n",
    "        w\n",
    "        for w in wsj\n",
    "        if re.search('(ed|ing)$', w)\n",
    "]\n",
    "print(word_participle_list[:13])"
   ]
  },
  {
   "source": [
    "表3-3 正则表达式的基本元字符，其中包括通配符、范围和闭包 P109\n",
    "\n",
    "原始字符串（raw string）：给字符串加一个前缀“r”表明后面的字符串是原始字符串。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  }
 ]
}